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New scalar 
and vector 
elementary 
functions for the 
IBM System/370 



by Ramesh C. Agarwal 
James W. Cooley 
Fred G. Gustavson 
James B. Shearer 
Gordon Slishman 
Bryant Tuckerman 



Algorithms have been developed to compute 
short- and long-precision elementary functions: 
SIN, COS, TAN, COTAM, LOG, LOG10, EXP, 
POWER, SORT, ATAN, ASIN, ACOS, ATAN2, and 
CABS, in scalar (28 functions) and vector (22 
functions) mode. They have been implemented 
as part of the new VS FORTRAN library recently 
announced along with the IBM 3090 Vector 
Facility. These algorithms ere essentially table- 
based algorithms. An important feature of these 
algorithms Is that they produce bitwise-identical 
results on scalar and vector System/370 
machines. Of these, for five functions the 
computed value result is always the correctly 
rounded value of the inffniteiirecislon result For 
the rest of the functions, the value returned is 
one of the two floating-point neighbors 
bordering the infinite-precision result, which 
implies exact results if they are machine- 
representable. For the five correctly rounded 
elementary functions, scalar and vector 
algorithms are designed independently so as to 
optimize performance in each case. For other 
functions, the bitwise-kienf ical consist iwda 
to algorithms which compromise between scalar 

•Copyrteht 1986 by International Business Machines Corporation. 
Copying in printed form Tor private use is permitted 
oaymcnt or royalty pmvided that 0) each n^rt^ctiou is done 
without alteration and (1) the Journal reference and IBM copyright 
notice are included an the fieu page. The tide nd abstract, but oo 
other portion*, of this paper may be copied o^^lxitaj royalty 
fixe whhom further permission by ootttputer-baaed and other 
infofmation^ervioe systems. Ptfmiaoa 10 rtpubhsh any other 
portion of this paper must be obtained from the Editor. 



and vector performance. We have been able to 
design algorithms where this compromise is 
minimal and thus achieve very good 
performance on both scalar and vector 
implementations. For our test measurements on 
high-end System/370 machines, our scalar 
functions are always fester (sometimes by as 
much as a factor of 2-5) as compared to the otd 
VS FORTRAN library. Our vector functions are 
usually two to three times faster than our scalar 
functions. 



1. Introduction 

* Hhtory of the FORTRAN intrinsic functions 
When FORTRAN became widely available, inert were 
already many subroutines [I] Tor elementary functions in the 
SHARE (IBM users 1 organisation) library. In addition, many 
computer installations had their own libraries, obtained from 
various sources. These were written for assembly language 
programs but could be used with FORTRAN. One compiled 
and ran in separate sessions and in the run session one could 
use subroutines of one's own choosing. 

U soon became obvious that a standard u bcsT set of 
subroutines should be included in aH of IBM's versions of 
FORTRAN. The original FORTRAN library thai came with 
704/709/7090 FORTRAN was writKn by IBM. h was 
"adequate," but many programmers continued writiiuj new 
and improved versions of various routines- Eventually most 
SHARE installations converged on a set of improved 
versions (mainly by W. Kaban at the University of Toronto, 
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W. J. Cody fil the Argon ne National Laboratory, and H 
Kuki at the University of Chicago) and distributed them via 
the SHARE library. The new routines were good enough for 
IBM to take them over as the official versions. 

When work began on System/360, Kuki was contracted to 
write the new library (since he had done ihe bulk of the 
then-current version). The first set of routines was available 
with the initial releases of System/360 FORTRAN. See [2-4] 
for general discussions of the problem and objectives of 
computing elementary functions. 

In his presentation [5] at the Mini'Symposium on 
Elementary Functions at the SIAM National Meeting in 
Seanle in July 1984, tilled "Software for the Elementary 
Functions." Cody described these subroutines; 

The result was an excellent library, but still not the best that 
could be done, even by standards of that time. Kvki was 
severely handicapped by a requirement that each program 
had to produce identical numerical results on ail machines in 
the family. Speed and space restrictions imposed by the 
smaller machines in the line, and variations in the machine 
architectures, meant that he could not produce programs that 
were optima! for any one of the machines- For example, 
because of timing considerations on the smaller machines, he 
used careful argument reduction on single-precision (32-bit) 
programs but not double-precision (64~bh) ones, and he 
minimized the use of double-precLtfonjlnating point division. 
This penalized those doing scientific computations on larger 
machines to meet restrictions imposed by smaller machines 
intended primarily for non-scientific use. Because of ihe 
inadequacy of the double-precision programs, many 
installations reverted to a non-standard library containing 
replacement programs written by Kuki or the group at 
Argonne. 

The first set of modifications to these routines was made at 
the time of the "Improved Roaring Point En gineeri ng 
ChanGe" (IFPEC), which fixed some of the deficiencies (most 
importantly, the lack of a guard dipt in long floating-point 
arithmetic) in the original floating-point architecture. Kuki 
made modifications to some of the routines 10 exploit the 
corrections (e.g., the fact that the Halve instructions now 
produced a normalized result), but the changes were not 
extensive. 

When the Systctn/360 Model 85 was in design [with 
extended 0 29-bit) precision]. Kuki was asked (see [61) to 
revise the library to add extendr^^ rccision functions. That 
version of the library was announced with FORTRAN H 
Extended about 1971 and has been the "standard" library 
ever since [6]. 

At the SHARE-54 meeting in Anaheim, California, in 
March i960. Wang [7] (sec also [8, 9]) described some 
madequaeies in the IBM Systcm/360 FORTRAN IV 



libraries and made suggestions for their improvement. He 
also described some results with a new set of subroutines 
which overcame these difficulties. The SHARE FORTRAN 
Project submitted a requirement to IBM to provide the 
Wang routines as an alternate to the Standard product 
library; IBM accepted the requirement and shipped the 
routines with VS FORTRAN Version l Release 2 as 
VALTUB in 1982, The routines did not meet with 
enormous acceptance due to the trigonometric functions 
being 20-4055 slower than the standard library; many users 
with small arguments stayed with the original versions, even 
though for targe arguments the standard library was less 
accurate* 

Scarborough [10] wrote versions of the library routines as 
part of the Optimization Enhancement IUP (FORTRAN Q), 
He made no changes to the Kuki algorithm*, but changed 
the programs to use shorter execution paths with less 
preparation for error cases until the errors actually occurred. 
He also changed the conversion routines slightly, again to 
gain Speed but with no perceptible change in accuracy. 

In the earlier computers, memory space was quite small 
and computer speed was orders of magnitude higher than 
many users were accustomed to. Compactness of the 
program and accuracy were therefore the most important 
considerations. As succeeding generations of machines and 
FORTRAN revisions were produced, the practice was to be 
able to assure users that the new versions would yield the 
same results as the old ones (see [10], for example), Fn many 
cases, certain FORTRAN programs were established as the 
tests of acceptability of plans and designs. Therefore, and 
this may be folklore, succeeding generations of users 
requested that new subroutines give exactly the same results 
as old ones even when the old ones were inaccurate, 

■ The need for new functions 

After some time, the requirement of complete numeric 
compatibility became an undesirable constraint. A 
compelling reason for a break whh this rule came with the 
advent of IBM's new Vector Facility. We thought that a 
program using the new vector instructions to mimic the old 
algorithms in such a way as to produce exactly the same 
results would result in inefficient vector subroutines. 
Furthermore, since a good algorithm for a scalar subroutine 
is, in general, not a good algorithm for a vector subroutine, 
and since the previous algorithms used divide instructions, it 
was necessary for the sake of speed to develop new 
algorithms and, therefore, forego the requirement for strict 
compatibility with the old subroutines. However, we agreed 
to provide the user the ability to get the same results from 
the subroutines whether or not be used the Vector Facility: 
thus, our task was to produce a companion set of scalar 
vcTsians of all subroutines which produced exactly the same 
results as the vector versions. Our new set of scalar 
elementary-function subroutines yield higher speed and 
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greater amiracy in a wide range of System/370 machines. In 
producing this new library, the new technology afforded us a 
number of great advantages: 

• With more memory available, we could make the 
subroutines longer and use tables. 



• Program development toots such as interactive computing 
and graphics were available. 

• Computer speed and availability permitted extensive 
testing* including, in some cases, testing every possible 
argument. 

• Software development tools produced 

To achieve the high performance goal set for the project and 
to obtain the greatest advantages from the facilities 
available, a number of programs were developed. 

The Vector Facility simulator 

In order to write and test programs long before the vector 
machine was available, Tuckcrman wrote a functional 
simulator permitting one to run and trace modest-sized 
vector programs on the available System/370 machines- The 
simulator was very easy to use and its tracing facilities were 
of enormous help in debugging programs. Its use enabled us 
to complete most of the vector subroutines long before the 
Model 3090 with its Vector Facility was available. 

The ulp ploi 

Starting with a demonstration by Moler [LI] showing how 
graphical displays of errors in ulns {units in or of the last 
place) can reveal important characteristics of error 
distributions, Agarwal developed such programs for use on 
our graphics systems. An ulp is the distance between the two 
nearest floating-point numbers of the actual result. These 
programs tested elemenUry-Amction algorithms and 
presented a graphical picture of the distribution of errors. 
They evolved into a very effective interactive tool for 
analyzing errors and suggesting strategies for the 
improvement and correction of our algorithms. Samples of 
output plots from this program are shown in Figures I 
through 9, and they arc discussed in Sections 2 and 4. 

Polynomial approximation 

Polynomial approximation subroutines were available from 
SL-MATM [12], the mathematical subroutine library. These 
were revised to compute in extended precision in order to 
give the high accuracy needed for the double-precision 
routines. Since much experimentation was needed to get 
last-bit accuracy, the approximation subroutines were 
incorporated into interactive programs which computed 
error distributions and statistics. They also produced 
machine-readable assembly language and/or FORTRAN 
statements containing the exact hexaO^cimai representations 
of the approximation coemcicnts. 

- Why do we want correctly rounded results? 
A great deal of satisfaction was obtained from the Tact that . 
frve of the intrinsic functions reported here always deliver 
correctly rounded results; these are SQRT, DSQRT. CABS, 
CDaBS, and EXP. One important aspect of this is that 
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co needy rounded results were obtained with surprisingly 
blue sacrifice in performance. A second, which may nave 
more far-reaching consequences. Is that the requirement Tor 
future compatibility docs not compel one to use the same 
algorithm when a new machine architecture would make a 
different algorithm more efficient. 

* Intangibles 

Use of the one-ulp criterion facilitates the preservation of 
many desired mathematical properties of elementary 
functions such as monotonicity, symmetry* and important 
identities. Furthermore when the result is an exactly 
reprcsentable machine value the one-ulp criterion 
guarantees that this result win be obtained. In most cases, 
the very nature of our algorithm design guaranteed correct 
symmetry properties. Special attention was paid to the 
examination of table boundaries in order to ensure that 
monotonicity was not violated. Of course, mono tonicity i$ 
not violated for the routines thai always deliver correctly 
rounded results. For other subroutines, large numbers of 
arguments were tested and no violation of monotonicity was 
found. 

Another desirable property is the preservation of the 
quadrants in computing the inverse trigonometric functions. 
Surprisingly, this is an example of a situation where the 
desire to obtain correctly rounded results was in conflict with 
the preservation of a mathematical property. For example, 
DATAN2(Y.X) was made to lie within the quadrant 
indicated by (X, Y). This means that at times we deliberately 
incorrectly round the result In one case, if X is a very small 
negative number and Kis a very large positive number, the 
exact result is slightly greater than t/2. The nearest machine 
number happens to lie in the first quadrant, but we produce 
the rounded-up value which is in the second quadrant 

• Performance of the new programs 
Speed 

At first glance, it would seem that the use of a vector 
algorithm would make it necessary to use the same 
algorithm for a whole vector of arguments, while a scalar 
program would test each argument and branch to a routine 
employing the best method for thai argument Therefore, 
one mighi expect the scalar-vector compatibility requirement 
to cause a loss of performance. By applying special strategies, 
as described in Section 3, wc found that we could keep the 
performance loss of the scalar subroutines minimal. 

Accuracy 

For our algorithms, the use of "tuned" tables and other 
techniques to be discussed below always gives errors of less 
than one ulp- To be more precise, the result is always one of 
the two machine numbers bordering the infinite-precision 
result Extensive testing snowed that correctly rounded 
results are obtained for more than 95% of the arguments. If 
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any computed value is subsequently found to violate the 
one-ulp criterion, it will be treated as a program error and 
the program will be corrected. Our programs were, in many 
cases, subjected 10 mathematical analysis which ensured 
accuracy provided that certain conditions were fulfilled. 
These conditions, in many cases, were tested numerically to 
ensure thai no violations occurred, and so correct results 
within one ulp arc virtually certain. 

Robustness 

The old VS FORTRAN library was written with the 
assumption that little or no effort should be spent on 
arguments which arc unnormauzed or subnormal (below the 
smallest number which can be represented in normalized 
floating-point format) or which, in some way, may be 
considered pathological. The present programs give results 
for both subnormal and unnormalized arguments, where 
possible. 

For arguments near singularities of the tangent function, 
the old intrinsic TAN function often gave error returns 
where, in fact, the correct results were machine-reprcscniaLle 
numbers. It was shown for the present work that for all our 
intrinsic functions except COTAN near zero, there can be no 
raachine-reprcscntable argument so dose to a singularity 
that the result is not a macnine-rcprcsentable number. The 
present scalar subroutines were written so as to give result* 
correct to within one ulp for all arguments for which 
normalized machme-re|*esentable results exist. 

2. Accuracy— The ulp concept 

The accuracy of these new programs is described here in 
terms of units in the last place, or "ulps." and is shown in 
Figs, 1, 4. and 7 and Table 1. 
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ACOS CIRC(0,PI> 
DACOS CIRC (0, PI) 

ASIN CIRC(-PI/2 r Fr/2> 
DASIM CIRC(-Pl/2rPI/2) 

ATAN TAN(-PI/2,PI/2] 
DATAN TAN(-PI/2,Pl/2) 
AT AN 2 P0LARM6**-16, 16**16) 
DATAN2 POLAR(16**-l6,16**16) 
COS LINEAR<-PI,PX) 
COS L0G(2**-18*PX,2**18*FJ) 
DCOS LlNEAR<-Pi;,PI) 
DCOS LOG(2* p -50*PI,2*»50*PI) 
SIN I.INEAJU-PI, PI) 
SIN LOG(2^»-1fl"PI.2«*ie«Pl) 
D51N LINEAR(-PIrPI) 
DSIK LOG(2**-50*PI,2**50«PJ) 
TAN LINEAR ( -PI/2 , PI/2 ) 
TAN L0G(2*"~1B*PI,2«*1B*Pll 
DTAN LINBAR{-PI/2.FI/2) 
DTAN LOG(2»*-39*PI,2"39*PIl 
COT AN LIKEAR(-PI/2,PI/2) 
COT AN LOG(2**-18*PI,2**1fi»Pl) 
DCOTAN LINEARI»PI/2.PI/21 
DCOTAN L0C(2*"-39*PI,2"39*PI) 
EXP LINEAR (-100, TOO) 
EXP LINEAR (-16,16) 
DEXP LlHEAR(-10Q,100) 
DEXP LINEAR (-16, 16) 
AUX5 L0G<l6**-65, 16**63) 
DLOG LOG(16**-65,16»*63| 
ALOG10 LOG<l6**-€5, 16**63) 
DLOG10 LOG< 16*"-65 , 16**63 1 
SQRT LOG("l6**-65, 16**63) 
DSQRT LOG(16"-65,16"»63) 
X**Y LlM£AR(.1 r lO)"»60.1 
X**Y LOG(16**-6S, 16**63) 7 
DX«*Y LIN£AR(.1,10)*-60.1 
DX**V LOG06*»-G5, 16*»63)**.7 

CABS P0LAR(16**-16, 16*""16> 
CPABS POLAR(1G**-l6, 16** 16) 
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* Floating-point number systems 

We assume that the computer arithmetic is being carried out 
in a given floating-point number system. Let b be the base, k 
be the number of base-/) digits in the fraction, and /, w be the 
lower and upper limits of the exponent e af b. Then a 
floatingpoint number X can be represented by j, <?, <v ■ 
Qk7 where s = ± , / <= e < u, 0 < a, < and the associ ated 
value is 

where 

k 

0Sm=S ajb~ i < t. 

A nonaero rToatinR-point number is normalized if the leading 
digit a, of the fraction is nonzero. The range of positive 
normalized floating-point numbers * is A ±X<b.^ 
In the IBM System/370 series of computers, b - 16; k - 6, 
14, or 28 for the short-, long-, or extended-precision formats; 
and/ = -^64,1^ +63. 



• The ulp concept 

Consider a positive normalized floating-point number X in 
such a system. Then a unit in the last place Of A" is defined as 
the difference between X and the neat larger floating-point 
number (or between X and b" h" X is the largest floating- 
point number). In me Systcni/370 representation, if 

X^m-I6 e , 
where 

1/16 ^ /m < 1, 

then an ulp of X in that system is 

ulp (X)= .0 01.16'=. 10 (MG^** 1 . 

For example anuip of .765432- 16 1 is .000001 . 16' = 
.100000- 16**. The digits of the fraction are hexadecimal 
digits (a 1, -•••9. A,B, -.F). 

If x is a positive reel number (of infinite precision) lying in 
the range of floating-point numbers, then an ulp of *, from 
the standpoint of the floating-point number system, is 
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FORTRAN (Kuki) DSIN. Plat of errors in iht FORTRAN DSIN 
subroutine In ulps for evenly (ag-disiribuied random m^um mis In die 
interval (2~ 50 ir, 2^). Note ibtt error* are given on a logttilhmic 
scale due to their large sizt . 



defined as an ulp of the largest floating-point number* 
which does not exceed K i-*-, where X is derived from x by 
"chopping." 

It is convenient to define an ulp of a negative floating- 
point or real number as the negative of an ulp of its absolute 
value. An ulp of zero is undefined. 

Ifwe are given a computer program which defines a 
floatingpoint-valued function Y-= F(X) of a floanngrpoin*. 
valued argument X, and which is intended lo approximate a 
given mathematical function y **f{x) (which cannot in 
general be realized exactly) at floating-point arguments x = 
X then the absolute (i,e„ not relative) signed error in Fat a 
given floating-point argument X U defined a$ error (F,f % X) 
= F(X) "J{X) Y — V = computed value minus true value. 
It is convenient to express this error in terms of ulps, i.e., 

ulp error (F, / X) - error (F, f. JO/uIp (K yl 

Here ulp {Y % y) is defined as the common value of ulp ( Y) 
and ulp (y), in the usual case when these values are equal. 
However, in the rare cases when they aie unequal, which is 
when y and y have different exponents, then ulp ( K y) is 
denned as the one or them which has the smaller absolute 
value. Forexaraple, if y=. 100000.16' + 1- ulp (y), _w here 
0 £ c< Vz, then ulp (y) « 16~ 3 ; aitdir Y = .FFFFFC, then 
ulp {Y) ~ 16^. Then ulp( Y, y) = 16™*, and ulp error 
(F,/ JO = 16* + 4, correctly indicating a poor 
approximation, rather than c + 4/ 16, which, would 
erroneously indicate a good approximation- A good 
approximation, of course, is r = JOOOCJO- lo~ l - 1, for 
which ulp (y, y) - ulp error (F,f*X) = c. For 
brevity we write this quantity as efu. 
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Abtpiuto xzLixb of argument 



j Wang** DSIN . Ptol of errors in Wang*!* DSIN subroutine in ulps 
! Tor evenly log-diitribuicd random arguments in Ihe interval (2"^, 
l^n). Note that errore ore given on a lognrilhmie scale due to their 
targe sze. 



In the new programs, an attempt has been made to 
minimize the maximum ulp errors. The results are 
summarized below and in the section on accuracy statistics. 

Our SQRT. DSQRT, and EXP functions satisfy \cfu I < 
0.5; they have best-possible rounding. This is sometimes 
called the "half-ulp" or "one-point- criterion. 
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Yorkiown X**Y. Plal of errora in the YorklOw* «ingle-prwu>ion 
implicit X**Y subprogram in ulps far an evenly du,iribul*J random 
argvimuni X in the iniervjj < .01, 10.) and Y = 60. 1 . 




Argument 



FORTRAN (Kwki) X*»Y. Plot of errors in the FORTRAN unglc- 
prociHiun Implicit X"*Y subprogram in ulps for unevenly di^iribuicd 
rahoornarBumcmXlnihc ijiicrvaU.OI, 10.) and Y - 60-1- Naie That 
errors arc given on a logarithmic scale due 10 their large sile. 
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Our CABS and CDABS functions satisfy I e/u\ £ 0.5 (this 
can also be called a half-ulp criterion). They have best- 
possible rounding, except thai unavoidably there arc cases 
when | e/u\ = 0.5, m which case rl would be equally correct 
to round downward or upward; we choose to round upward. 
This is consistent with ihc System/370 definition of 
rounding. An example in short precision is the following. Lei 
rV e .4- I6. b - 8. - .3FFF F8-16*, and let W - X + f Y = 3 N 
+ 4JV/ = .BFFFE8-16* + JTTTE0-I6 6 /. Then * = abs(W) 



5/V «= . 1 3FFFD8 ■ \6 7 He$ exactly midway between 
. 1 3FFFD - 1 6 T and . 13FFFE • 1 6 7 . Either of those numbers 
can be regarded as a rounding Z of the true value, and in 
either case \e/u\ = 0.5. One of them (the larger) is returned 
by CABS, and similarly for CDABS. 

This rounding ambiguity can also occur for x*\ for an 
example in short precision. 258- 3 = (.102000' 16V - 
.1060CO8 - 16', which lies exactly midway between .1060C0* 
16' and .10600- 16\ However, correct rounding is not 
always achieved for x r . The rounding ambiguity cannOT 
occur for any other of our functions. (The only rational 
solutions x, y of log I0 x = y arc for y a nonnegative integer. 
The only algebraic solutions y of our remaining^*) = y 
are for x = 0 in sin *, cos jr. tan jr, and for y = 0 in their 
inverse functions [ I3J. Except for EXP, correct rounding is 
not always achieved for our other functions. 

All of our functions are believed to satisfy | eju \ < 1.0; 
equivalently. a computed function value Y, if not exactly 
equal to the true value j\ is one of the floating-point 
numbers just above or just below y. We have called this the 
"one-ulp" or "two-point" criterion. An effort has been made 
to make the errors | e/u | substantially less than 1 .0, and the 
results can be judged from Table 1 and the ulp plots (Figs. 1 , 
4, and 7). Note that errors in | e/u | of up to 0.5 arc an 
unavoidable result of any rounding and we have endeavored 
to keep the actual errors as nearly within this bound as is 
practical. 

Additionally, the following desirable properties are 
attained: 

1 . Special case values which should be exact floating-point 
numbers are so in fact, eg. EXP(0.) I SQRTC25) - 
.5. 16." 25 « 2. (This is a consequence of the one-ulp 
criterion.) 

2. The F(X) are strictly even or odd functions, i.e^ F\— X) 

F(X) or F(-X) = -f\X\ respecriveJy, for every 
floatine-pomt number X, if the underlying function J{X) 
is even Or odd like cosine or sine, respectively- Because of 
the previous definition of an ulp of a oegath/c number as 
the negative of an ulp of its absolute value, ulp error 
(Fj. X) - ulp error {FJ. -X) whenever/ and Fare 
either even or odd. (That definition also makes the 
interpretation of ulp plots of oscillatory functions easier 
man if an ulp were always considered to be positive, in 
which case there would be a near-mirroring of the regions 
Y £ 0 and Y < 0 onto each other.) 
3. The functions are believed to be monotomc where 
appropriate although this is not guaranteed, 

■ Ulp plots 

The "ulp plots" 1 shown later in this paper are examples of The 
output of a very useful tool which we developed on hearing 
from Pan! [14] of similar outputs produced by Molcr [I I). 
A specified set of arguments, chosen at random or 
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linearly, based on a given distribution (eg,, linear or 
logarithmic) over a given range, is supplied 10 two functions, 
typically one which is to be tested and another which is to be 
used as a reference (c-g., an extended-precision version). The 
differences are computed in terms of ulps; ihcy can be 
plotted immediately on a high-resolution graphics terminal 
and can also be plotted immediately on hard copy if desired. 
The horizontal axis shows the range of arguments; the 
vertical axis shows the error in ulps, plotted linearly if hs 
range is small, logarithmically if its range is large. 

The ulp plots of our functions are rather featureless, 
showing a random scattering of points, mostly within ±0.5 
ulp. because of the accuracy attained. But during 
development the ulp plots were very revealing of specific 
numerical difficulties, much more so than mere statistical 
summaries would have been. Some of these revelations can 
be seen in the plois of functions from the older libraries. 

For example, the plot (Fig. 5) of D5IN from the VS 
FORTRAN library versus QSIN shows, for one thing, the 
loss of ulp precision for large arguments, starting not far 
above x* » 1, due to imperfect argument reduction. Tt also 
shows, surprisingly, "spite"* of errors of up to 12 ulps (see 
also Figs. 2. 3. and 6). even for very small arguments [15]. 
where the approximation sin x x should be accurate to 
well under 0*5 ulp. These errors resulted from a 
multiplication of all arguments by 4/ir during range 
reduction, followed eventually by a polynomial evaluation 
which in effect multiplied the reduced arguments by 
Whenever the fraction of a floating-point argument lies 
between and 1, the first multiplication yields a fraction 
between I and 4/* and an exponent increased by 1. This 
results in a "chopping" loss of nearly a digit of precision, 
which is, of course, not restored by the second 
multiplication. Both of these sources of emirs have been 
eliminated in our programs- 

The ulp plots and statistics of our functions were made tor 
10000 or more random arguments, and show no errors as 
large as one ulp. The value of 1O000 was chosen because it 
nicely exhibits the salient features of our functions. In testing 
our functions, we ran sample sues well into the tens of 
millions- The single-precision functions of one argument 
were nearly exhaustively tested. This is how we know that 
the Short EXP algorithm delivers the correctly rounded 
result for all arguments. However, it was not feasible to test a 
function for all long-precision arguments, nor for all pairs of 
short-precision arguments (including complex arguments). 
Our belief that it is possible to produce a library ^lisfying 
the one-ulp criterion throughout has been buttressed by 
analysis, by extra-dense ulp plots in some narrow critical 
regions such as across boundaries of table intervals and 
hexadecimal exponent boundaries, and by extensive 
numerical testing, but tt has not been proven in all cases. 
Any observed violations will be regarded as program errors 
and will be corrected. 
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Wang's X*«X Plot of errors in W&ng's singJc-precisian implicit* 
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The collection of accuracy statistics for each orthe 2$ 
functions, for samples of size 10000 over suitable 
distributions of arguments, is described in the subsection 
"Near-correct accuracy.** Table 1 shows the number of 
correctly rounded arguments and the maximum ulp errors 
observed. Tables 2 and 3 show the corresponding statistics 
for the VFORTLTB and VALTUB (Wang) routines. These 
tables are described in detail in Section 4. 

3. Programming strategy 

■ Irumduction 

In this section we describe the main concepts and 
programming methodology that constitute the theory and 
implementation of our new set of elementary function 
programs. 

Our original task was to produce bitwise-corn patihle 
algorithms in both scalar and vector for the Model 3090: 
that is. the vector and scalar versions should produce 
identical results for all legal arguments. The divide 
instruction on the 3090 is a relatively slow instruction 
compared to mulriply, add, and subtract instructions for the 
vector and scalar. Therefore, almost all of our algorithms 
avoid divide instructions. Previous scalar routines have not 
handled unnormalized arguments acceptably. Wc decided to 
handle unnormalized argument* in scalar in order to make 
our functions robust. In the vector hardware, multiplication 
by an unnormalized nonzero argument produces an 
uxinorrnalized-operand exception. The scalar hardware does 
not have ibis exception. Therefore, we decided not to handle 
unnormalized nonzero arguments in the vector elementary 
functions. These arguments may produce unpredictable 
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Table 2 Version 1 Accuracy SUUSlics: pc^nt cotnxUy roondcd, avenjgc ccnw. Wlh percent error bound. 



FUNCTION DISTRIBUTION (RANGE) 



p 


EAVE 


E99* 


TRIALS 




u • **** 


1 .26 


10000 




0.63 


1 .47 


10000 


£fi Q"7 


0 30 


0* 96 


10000 


56 . 68 




1.13 


1 oooo 




0.68 


3 .76 


10000 


38 82 


o" 69 


2 .03 


10000 


A6 . 2? 


o l 99 




10000 


4 5. 29 


0 77 


4 !s5 


ioooo 


JD< ?? 


0 .74 


1 .78 


10000 


46 .10 


0-60 


1 !fi6 


10000 


1 2 > 86 


10-. 93 


146. 39 


1000O 


33*8? 


. 79E+-14 


. 17E+16 


10000 


36.17 


0,7S 


1 .83 


10000 


27*06 


1 !04 


10. 19 


10QOU 


1 5 ♦ 91 


9.76 


109-42 


10000 


3 *. 87 


.79E+14 


.17E+16 


10000 


37 . 64 


0.84 


5.33 


10000 


35 . 37 


0. 98 


7.91 


9696 




18.37 


129.23 


10000 


14.67 


.24E+12 


.32E+13 


10000 


38.82 


0.84 


5.79 


10000 


34.93 


1 .04 


6.61 


9709 




24.25 


163.46 


10000 


10.53 


.30E+12 


.33E+13 


10000 


98- A3 


0.25 


0.51 


10000 


97. B6 


0.25 


0.52 


10000 


63.76 


0.41 


1.10 


ioooo 


6ft. 27 


0.40 


1 .05 


10000 


74.94 


0.34 


0.99 


10000 


57.06 


0.50 


1.47 


10000 


67.80 


0.57 


4,75 


10000 


58.81 


o.as 


8.05 


10000 


97.21 


0.25 


0.54 


ioooo 


100.00 


0.25 


0.50 


ioooo 


l .29 


124.57 


691-65 


ioooo 


3.3S 


37.14 


19^.40 


10000 


0.61 


154.66 


674.08 


10000 


2.20 


53.99 


282.56 


ioooo 


37.41 


0.77 


2.12 


ioooo 


39. ie 


0.7B 


2.13 


10000 



acos chuc(o,pi> 

DACOS CIRC(0,P1) 

ASIM CIRC<-PI/2,PI/2> 
PAS IN crRCf-PX/2, PI/2) 

ATAN TAN (-P1/2, PI/2) 
OATAN TKt1(-*l/2,PI/2) 
ATAN 2 POLAR(l6*"-16, 16**16) 
D ATAN 2 POLAR [16"»-16 4 16"»16> 
COS U NEAR ( -PI. PI) 
COS L0G(2**-18*PI,2«*18*PI) 
DCOS LINEAR I, PI) 
□COS LOG(2«-50»PI,2"50»PI) 
SIN LINEAR ( -PI, PI) 
6IN LCG(2**-18«PI,2*"1B*PI) 
DSTN LINEAR PI) 
D5IN L05(2»*-SD*PI,2«SO*PI) 
TAN LltTEAR(- PI/2, P 1/2) 
TAN L0G(2**-18*PI,2-*1H*PT) 
OTAN LINEAR(-PI/2,PX/2) 
DTAN LOG(2**-39*Pl,2»*39*PD 
COTAN LIMEAR< -PI/2. PI/2) 
COTAN L0G<2*»-18»PI,2*"IB»M) 
DCOTAN LINEAR<-PI/2,Pi/2) 
DCOTAN LOG(2**-39*Pl # 2-*39»PI) 
EXP LINEAR(-100. 100) 
EXP LINEARt- 16 , 1 6J 
DEXP LINEAR (-100. 100) 
pEXP LINEAR { - 1 6 , 1 6) 
ALOG LUfc(16**-6S, 16»*63) 
DLOG LOGt16**-6S, 16**63) 
ALOG10 LOG(l6"-65, 16**63) 
DLOC10 LOG(l6""-65. lh--63) 
SQRT LOGn6*»-65, 16**63) 
DSQRT LOG(16**-6S. 16*"63) 
X*»Y. LINEAR (.1,10) "60 _ 1 
X**Y LOG(16'*-65, 16"63) ♦*.? 
DX**X LrNEAR(.l,lO)**G0.1 
DX**Y LOG(16**-65,16**63)»«.7 

CABS POLAR (16**-1&, 16"«16) 
CDABS P0LAR(16**-16, 16»*16> 
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results. Al! arguments with zero fraction are correctly 
handled, regardless of whether the exponcm is zero. Thus we 
claim bitwise compatibility between the new scalar and 
vector routines for all arguments except nonzero 
unnormalized operands. 

Under this requirement we wanted our function* to be as 
accurate as and to execute faster than the current VS 
FORTRAN product. The very stringent requirement of 
bitwise compatibility restricted the speed of both the vector 
and scalar algorithms. Our requirement was to weigh scalar 
and vector algorithms equally in our attempt to meet the 
above goals- Some remarks on the way we handled vector/ 
scalar trade-ofls now fallow. In our vector codes, only those 
arguments are handled which would have been processed in 
the main path(s) of the scalar code All arguments which 
require special processing are handled by the scalar code, 
cither by branching to the appropriate scalar function or by 
duplicating some of the scalar code as part of the vector 
function. For trigonometric functions, more than S0% of the 
arguments between zero and */2 are handled in the vector 



mode. For other functions, almost all of the arguments in a 
certain distribution are handled m the vector mode. Some of 
chedementary-function routines always produce correctly 
rounded results. For these routines, different algorithms can 
be used in the scalar and the vector mode; while preserving, 
the bitwise compatibility feature of the library. This fact was 
used to redesign the vector algorithms for some of the 
correctly rounded vector functions so that they could be 
independently optimized for better performance on the 
vector hardware. 

In this section, we cover the fallowing topics: Tuckerman 
rounding (see 1 16(a), p.10) and [16(b). P> 14». table-based 
approach, near-correct accuracy, last-track programming, 
robustness, and error handling. Tuckerman rounding is a 
simple multiplicative algebraic/numeric condition that 
allows us to produce correctly rounded results for the square 
root and CABS functions. Hull [17] has also described a 
condition, but bis condition requires higher-precision 
arithmetic Kahan [ 1 8) has also discovered a similar 
condition; his result uses a divide instruction- Tuctennan's 
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Table 3 VALTUB accuracy sLilittics (for 10 000 trials); oereenl correctly rounded, average error, 99th percentile error bwntt 



FUNCTION DISTRIBUTION (BANCO 


P 


EAVE 


£99 X 


DCOS LINEAR I -PI, PI) 


2). 51 


1 .40 


4.16 


DCOS LOG(2**-50*PI, 2**50*PI) 


11 .69 


.46B+13 


. 1 SE+09 


DSIH LINEAR PI) 




1 .3fi 


4.16 


DSIW LOGf2**-50*PI„2*»S0»PI> 
DTAN LINEAR (-FT/2, PI/2) 


6.57 


-7 3E+07 


. 1 


36.68 


1.02 


7.57 


DTAN L,OGC2«--39*PI.2**a9*PI) 


24.21 


.16E+05 


. 1 7E+06 


DCOTAN LINEAR (-FI/2, PI/2) 


30.25 


1.01 


4.87 


DCOTAN LOG(2*--39*PI,2«»39*Fl) 


19.43 


. 17E+0D 


. 1SE+06 


EXP LINEARl-100,100) 


98.43 


0.25 


0.51 


FXP LtN£Aft(-16, 16) 


97. 66 


0.25 


a. 52 


DEXf LINEAR(-l00,1u0] 


63.76 


0.4 1 


l .10 


DEXP LINEARC-16, 1€) 


64.27 


0.40 


1 .05 


X**Y LINEAR (,1 f 101**60. 1 


56,45 


o.ei 


5-0T 


X**Y LOG(16**-65, 16»"631".7 


9B_9A 


0.25 


0.50 


DX"Y LINEAR ( . 1 1 10) ••SO . 1 


27.27 


1 _4G 


6.82 


DX**Y LQG(T6«*-65, 16»*63)**.7 


67.4S 


0.39 


1 .32 



condition is of historic significance, as »U use allowed us to 
produce IBM's first elementary fan crion that delivered 
correctly rounded result* for all arguments. 

The lable-based approach follows naturally from a 
fundamental idea: Let an elementary function be expressed 
as the sum of some exact value and a small correction, cf. 
Kuki and Ascoly 161 FullertOn [19). We developed several 
ways to find such an exact value. For example, w adopted a 
strategy of sharing one large (about 256 entries) table over 14 
routines (TAN, COT, ATAN, ATAN2) in both short and 
long, vector and scalar. (We did not vectorise short-precision 
TAN and COT.) We originally coded SQRT and CABS 
using this technique (eight routines sharing a table or size 
192 entries of 12 bytes per entry). In our TAN/COT/ATAN/ 
ATAN2 scalar/vector programs, we use a common table 
where, for each table entry, an Xq is chosen near the middle 
of the table interval such that tan (jr^ is an exact short word 
and Xn is siored as a double word followed by its short-word 
continuation (hexadecimal digits 1 5-20). Actually. jf 0 is a 
transcendental number which we approximate to 20 hex 
digits. In our approach, with the same a mourn of table 
storage ( 1 6 bytes per table entry) we are able to get up to 20- 
digit precision. The only disadvantage of this approach over 
the next one is thai it requires an extra addition (to account 
for the continuation of jr^). 

We heard about another way, called the Accurate Table 
Method, from the work of Gal 120]. Kalian has also 
informed us that Miller [21] wrote about the extra-accurate 
tabic idea in 1958. Gal's idea helps in eliminating the need 
to store the continuation (beyond the 14* digit) of the 
function value. Typically, the last digits of (a double 
word) are chosen so that typically hex digits 15-17 otjlxj 
are zero. Thus, with 16 bytes of storage per table entry, 
approximately 1 7-digit precision can be obtained. This 
method saves one floating-point addition over the alternative 
approach above that develops an exact value. There is 
another way to save arithmetic; for example, the table for 



DEXP is an accurate table with a constant built into h so 
that an extra floating-point add can be saved. For other 
codes (DSIN/DCOS, DASIN/DACOS) we developed a new 
concept of built-in rounding. Combining the use of this type 
of u accurate table" with Gal's method saves an extra 
addition. 

We usually get correctly rounded results by keeping the 
correction term sufficiently small. When the correction term 
is then added to the exact values a right shift occurs, usually 
producing the correctly rounded result. We chose to make 
this right shift at least two hexadecimal digits. This choice 
produced correctly rounded results more than 95% of the 
time. 

The idea behind fast-track programming is to judiciously 
choose a series of cheap tests which filter out difficult 
arguments that rarely occur. These rare arguments require 
extra computing; our approach is to do more computing 
only when necessary. The normal arguments filter quickly 
into the main path of the code where the instructions are few 
so that me execution is fast 

Robustness and error handling are features that cost very 
little. Error handling is a rare event; we presume that it will 
not occur. Only when an error-producing argument is 
detected by the code do we set up me complete subroutine 
linkage necessary for a trace-back. Hence, most of the time 
we pay only the cost of a minimal subroutine linkage. In the 
text lo follow, we discuss error handling as part of fast-track 
programming. 

The term robustness refers mainly to delivering correctly 
rounded results for onnormalizcd arguments (22] and to 
being precise at overflow and underflow boundaries. These 
problems rarely occur in the FORTRAN environment. 
Again* we have cheap or no-cost tests to detect 
unnormalized operands. The benefit to the user is thai he 
knows he will never get garbage from the scalar demcntary- 
function routines; fox ail inputs in the domain of the 
elementary runction the result is correct within one idp- 
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• Tuckerman rounding 

The previous square-root tontines*, SQRT and DSQRT, 
usually rerurn the correctly rounded square root, but cot 
always. The new routines always do. by the following means, 
which we call Tuckerman rounding- 

If jc, p, y ere positive floau'ng'Painl numbers, such thai y is 
the (possibly unknown) nearest floating-point number to </r. 
and y is a "candidate" u> be % produced by some 
approximation, then y = y'\f and only if 

(y_+ v)/2< Jx<{y + yJP> 

(equalities cannot occur), where y_ and y„ are the fioating- 
poim numbers just below andjust above y. The terms in 
these inequalities cannot be evaluated directly on the 
computer, since Vr is irrational and unknown, and the other 
terms of the inequality are not floating-paint numbers of the 
given precision- However, these inequalities can be shown to 
be equivalent to 

>L ■ y < x * y * y+ . 

where « denotes Systcm/360/370 multiplication (which 
truncates the result), so that the tests are easily carried out 
without the need for extra precision. (Note the asymmetry, 
one <, one <.) If the left inequality fails, y is too large; if the 
right inequality fails, y is too small. 

U is convenient to first use an approximation which is 
known to give y in the range 

Jx - 1.5 ulp < y < v£ + 0.5 ulp. 

Then it is sufficient to test whether 

xxy-y.* 

where y+ = y 4 ulp (y). If the inequality holds, y = y\ 
otherwise, $-y+< 

A modification of these tests is used to achieve correct 
rounding in CABS and CDABS (absolute value of a complex 
number). Here, however, in some cases the true fimcrion 
value can lie exactly between two consecutive floating-point 
numbers; in these cases the hufcer is chosen. 

• Tublt'based approach 

A central idea in obtaining high accuracy for our elementary 
functions is the following: Let 

0) 



\corr\<z^\BXACT\, 



(2) 
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ef{x) = EXACT (x Q ) + corr U x& 

where e/stands Tor an crememaiy function, EXACT is some 
machine number that represents ef{x^ exactly or to higher 
precision, and corr is approximated by CORR, a small— 
usually two orders of magnitude smaller— machine number 
that is computed. The value is either not a machine 
number or some specially chosen machine number that 
makes efixj especially precise, where x - *<> is again about 
two orders of magnitude smaller than x and x* Suppose that 



Then an uip of e/t-*) is al kast 256 times an ulp of dorr, and 
several floating-point operations win, in worst case, 
contribute no more than, say, 5/256-ulp error to the 
computation of corr by CORK a polynomial minima* 
approximation to cert. Now [ CORR - corr\ < 5/256 ulp, 
and, if desirable, a rounding can be added to CORR. The 
final addition of the table value and the correction term 
always introduces an error in the range zero to one ulp (if 
both the terms are of the same sign) or —1/16 ulp to 15/16 
ulp (if they arc of opposite signs) because of the properties of 
the IBM System/370 floating-point Add and Subtract 
Operations. This is the largest single source of error in our 
compulation. Moreover, this error is biased with an average 
of 1/2 ulp (if the terms are of the same sign) or 7/16 ulp (if 
they are of opposite signs). We compensate for this bias by 
incorporating a compensatory term of 1 5/32 ulp (which is 
the average of 1/2 nip and 7/16 ulp) in CORR. This 
imperfect compensation of bias adds 1/32 ulp to oar overall 
error. Thus the final floating addition, EXACT + CORR* is 
made with absolute error £ 1/32 ulp, and hence (fix) can be 
computed with error no more than, say, 13/256 ulp. For our 
elementary functions these facts yield the correctly rounded 
^rtjf) most of the rime, and wiih error less than one ulp for 
ailx. For some functions, such as DEXP, the relative signs 
of the two terms are fixed, and for these functions wc can 
apply perfect compensation for the bias error. This results in 
99,8% correctly rounded results for DEXP. 

A consequence of (1). (2), and the Tact that cjix) is 
diffcrcntiable is that the set x^ of points needed to compute 
%f[x) for arbitrary x is about size 256, The function CORR 
approximating corr is a linear combination of polynomials 
in the variable bx — x - The constants in the linear 
combination are functions of x, x^, and Ax, and these can be 
easEy and cheaply represented in tables. An unexpected 
benefit of this approach is that CORR can be computed 
cheaply, mainly because the polynomials in &x have low 
degree. Thus the table-based approach simultaneously 
produces a method to compute ef{x) both more accurately 
and fester than the approach of using a single polynomial or 
rational approximation. 

• Near-correct accuracy 

All the functions produce results which are strictly less than 
one ulp away from the infinite-precision results. One 
implication of this is that if the irifinite-precision result is 
machine-reprcscntablc (i.e., it is a valid machine number), it 
is produced by these functions. This is particularly important 
for functions such as square root, absolute value of a 
complex number, and the power function. For these 
functions, the result is often an exact machine^epresentablc 
number, which our routines produce. Most elementary 
function libraries do not guarantee that. As an example, 
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X**LO will always be X for our functions, whit for some 
other libraries the result could be many ulps away from X, 
sometimes as ranch as hundreds of ulps away. Also, X~2.0 
usually produces a correctly rounded value, while X*X 
always produces the truncated value 

In computing the elementary functions, one can always 
obtain sufficient accuracy by doing arithmetic in higher 
precision. On most IBM System/370 machines, except the 
low-end machines, short-precision operations take about as 
much time as long-precision operations- Therefore, wc 
decided to take advantage of this to obtain good accuracy for 
short-precision routines. On all IBM System/370 machines, 
extended" precision routines are considerably more 
expensive. On the 3090 Vector Facility, they are not 
supported. Therefore, wc could not rery on extended- 
precision operations to deliver the desired accuracy for the 
Long-precision routines. The tables were designed 10 deliver 
the desired accuracy for the long-precisian routines. In most 
eases, the same tables were used to compute the short* 
precision functions. Except for EXP, short-precision routines 
were not independently designed. 

Since short-precision routines use mostly long-precision 
operations, the final rounding was easily and accurately 
accomplished by the LRER (load-rounded short-precision 
from long) instruction. This explains why most short- 
precision routines have close to 100% correctly rounded 
results. On the negative side, use of longHsrecision operations 
slows down short-precision routines on the lowend 
machines. 

■ Fast-trade programming and error Handling 
One of the standard practices in assembly language 
programming is to place at the beginning of the program a 
header or -eye-catcher" identifying the program. This is 
useful for trace-back in case of an error but slows down the 
code, as an extra branch is required. For the scalar 
elementary-function routines, where the total number of 
instructions executed is small, this overhead becomes 
significant. Since our routines are robust and do not produce 
an unexpected error (Such as an intermediate underflow or 
overflow), we have eliminated the header, which speeds up 
the execution for most or the arguments. In the rare case 
where the argument or result is out of range, the registers are 
appropriately modified to point to a correct header, so that 
proper error messages and error trace-back information can 
be given to the caller. This technique was used by 
Scarborough [I0J when he sped up the original IBM library 
designed by KukL 

In coding the scalar elementary-function routines, a great 
deal of effort was spent in minimizing the number of general 
registers used, as they need to be saved and restored, except 
that by convention the registers R0 and Rl need not be 
saved and restored. Therefore. RO can always be used and 
Rl can be used after it is no longer needed to fetch the 



arguments or for error traceback. Additional registers, if 
needed, are saved and restored in a donWc-word-ali&ned 
area. The registers are restored just before the final result is 
computed. This facilitates use of the conditional branch 
instructions to exit from the routine. In many situations, a 
test is made and for certain conditions we exit from the 
routine, or else we do more processing before exiting from 
the routine. 

Similarly, in coding the vector elementary functions, a 
great deal of effort was spent in minimizing the number of 
vector registers used, since they need to be saved and 
restored if they are in use, as signaled by the calling program. 
In coding these functions, we have optimized performance 
for what we consider a reasonable distribution of the 
arguments, although all arguments are properly handled. 
These "most-likely" arguments are handled in the main path 
of the routine with no branches or minimal branches taken. 
The unusual arguments which require extra processing or 
somewhat different processing are quickly filtered out and 
processed by branching to ctitTerent segments of the program. 
The program flow can be described as a tree structure. The 
very first test in the main path usually routes most of ihe 
unusual arguments to some point of control. At that point, 
further tests arc done to properly classify and process the 
argument. This minimizes the number of tests to be made in 
the main path of the code, leading to an efficient 
implementation for the most likely arguments. 

As mentioned above, for some arguments extra processing 
may be required. This extra processing may take the form of 
some preprocessing and postprocessing, with essentially the 
main path code in between. In this situation, after the 
problem argument has been testified and prcproccssed, we 
save R 14 (the return address register) and modify it to point 
to a postprocessing labeL Then we branch into the main 
path, where after all the processing BR 14 automatically 
branches to the postprocessing label. There, extra processing 
is carried out, and R 14 is restored to its original value for 
exit fro in the routine. This procedure eliminates the need for 
additional tests in the main path of the code. 

The vector functions do not handle error reporting for the 
arguments. The arguments which require error messages arc 
handled by the scalar code. In the vector code, some quick 
checks are made for the possibility of error reporting, and in 
thai case, either all the arguments or some of the arguments 
are processed by calling the corresponding scalar function. 

The vector-mask-mode instructions are heavily used to do 
conditional operations, which in the scalar code are 
normally handled by branching. By doing a few mask-mode 
operations we are able to combine many main paths of the 
scalar code into one single vector code. This helps in 
maintaining close to foil vector lengths for all the operations. 
Where ar^ropciatc. vector-store-comprcssed (VSTK/ 
VSTKE/VSTKD) instructions are used to separate the 
arguments into two or more cases or to collect the 
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exceptional arguments. The separated arguments are 
processed in either the scalar cr the vector mode depending 
on the specific situation, and the results are inserted into the 
rcsnh vector by vector-load-expanded (VLY^VLYE/VLYD) 
instructions. 

Care is token not to produce any intermediate underflows. 
This problem can sometimes be avoided by masking out the 
underflow mask bh in the program status word (PSW). For 
the scalar code, where only one argument is processed in 
each calk the extra cycles required 10 mask out the 
underflow bit and to restore it on exit are not justified, and 
so other coding precautions arc taken. But in the vector 
code, where many argument* are processed in a single catU 
the extra cost of saving and restoring the underflow hit is 
justi6cd. This is done for some orthc vector functions to 
improve performance. 

• Robustness 

The new Functions handle unnoTmalized arguments correctly 
in scalar mode. For unnormaJized arguments, a normalized 
result is produced with an accuracy of one ulp (or 0.5 vlp, in 
the case of perfect rounding). (For the functions with one- 
ulp accuracy, in rare cases, chis result may be one ulp away 
from the result produced by the equivalent normalized 
argument.) The previous library did not always handle 
unpormalizcd arguments correctly. Oficn the results were 
meaningless. 

Another feature of the library is that the argument and 
result boundaries are strictly observed. For example, for the 
power function, if the result is wiihin the range of machine- 
repirscntable numbers (Le.. it does not underflow or 
overflow), it is always produced without any error messages. 
In the previous VS FORTRAN library, for a large number of 
machine-represcntabte results close to underflow/overflow 
boundaries, an underflow or overflow message was issued 
and the result was not computed. This was true for many 
functions, For ail the new functions, if the exact result is 
within the underflow and overflow range of the machine, it 
is computed with one-ulp accuracy. Furthermore, in the 
computing process, intermediate underflows and overflows 
are strictly avoided. This has been achieved by appropriate 
scaling, if the possibility of an underflow or overflow exists. 
After the scaled result is computed, it is examined to see ir 
the final result is going to be within the machine- 
Tepresentable range. The previous library produced 
intermediate underflows for many functions. 

4. Performance 

• Speed 

In this section we compare the speed and accuracy of the 
new elementary Functions to those in ihe present libraries 
(VFORTLIB and VALTLIB). 

TflB les 4-7 present information on speed. Table 4 gives 
ihe time per call (in microseconds) for the new functions. 



averaged over 10000 random arguments on the machines 
indicated. These times were obtained by timing a 
FORTRAN loop like 

DO 1 J = 1,N 
I Z(J) = F(X(J)> 

compiled with VS FORTRAN 1 AO (with A r = 10000) and 
dividing the resulting time by M The lime listed is the 
median of three trials (with different random arguments). 
This procedure means that the times include loop and 
subroutine-linkage overhead. To give the reader an 
indication of the magnitude of this overhead we have 
included "dummy" times, which arc timings for functions F 
which consist of an immediate return to the caller (a BR 14). 
There are six of these, depending on whether F is a single- or 
double-precision runction of one or two real or one complex 
argument. For some reason* in the DO-loops we used, the 
FORTRAN compiler treats X(J)**Y(J) differently from 
other functions of X{J) and Y(J): hence, the two-argument 
dummy times only apply to (D)ATAN2- Timings for the 
new functions depend on what distribution of arguments is 
assumed. We have attempted to choose representative 
distributions. The times will increase when scaling is 
required to avoid intermediate underflow or overflow. For 
the inverse trigonometric functions, we assume that the 
result is uniformly distributed in the indicated range. The 
logarithmic distribution means that the arguments are 
chosen in the indicated interval in such a way that their 
logarithms are uniformly distributed. The polar distribution 
means that we choose a pair of arguments r cos 0, r sin 0 $o 
that r is logarithmically distributed in the indicated mnge 
and 6 is uniformly distributed in (0. 2*). For tbe power 
runction, X varies as indicated while Y is held constant. For 
thc trigonometric functions, the logarithmic distribution in 
effect averages the speed on large arguments (which require 
precise argument reduction) wiih the speed on small 
arguments (which may take a special fast track through the 
code). So that the interested reader may separate these 
ranges, we have indicated below each of these times what 
percentage of the total time is contributed by arguments in 
the lower and upper halves of the logarithmic range, 
respectively. This separation is not always possible for vector 
routines, but in these cases it is roughb correct. 

Table 5 lists the ratio Of the VFORTLIB times to the new 
function times (as given in Table 4). Since mere arc no 
vector functions in VFORTLIB. Column 7 is the ratio of the 
times for the scalar routines in VFORTLIB to the times for 
the new vector routines. Table 6 similarly compares the 
VALTUB times to those of the new functions. We include 
in Table 6 all the functions in V ALTLIB. 

Table 7 gives three sets of ratios. Column I b the ratio of 
the times for the old library on the 3081 KX to the times for 
the new scalar library on the 3090. Column 3 is the ratio of 
the times for the old library on the 3081 KX to the times for 
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FUNCTION DISTRIBUTION < RANGE J 



4331-2 4367-5 4361-2 4361-3 3061KX 3090-S 3090-V 



EDUM 
E2DUM 

COUM 

DOOM 
D20LTM 
OODUM 

ACOS 

AS IN 
DASIM 
ATAN 
DATAK 
AT AN 2 
DATAN2 
COS 
COS 

DCOS 
DCOS 

SIN 
SIN 

D5IN 

TAN 
TAN 

OTAN 
□TAN 

COTAN 
COTAN 

DCOTAN 
DCOTAN 

EXP 
EXP 
DEXP 
DEXP 
ALOG 
DLOG 
ALOGlO 
DLOG10 

SQRT 
OSQRT 

x"V 

DX"Y 
DX**V 

CABS 
CDABS 



CIRC<0,PI) 
CIRC (0, PI) 
ClRC(-PI/2,PI/2) 
CIRC (-PI/2 , PI/2 ) 
TAN (-PI/2, PI/2) 
TANOPI/2,Pl/2) 
P0LAR(16**-1 6, 16**16) 
POLAR ( 1 6* *- 1 6 , l 6** 1 6 ) 
LINEAR (-PI, PI J 
L0G(2**-18*PI,2**1B*PI) 
SUBRANGE PERCENTAGES 
LINEAR {-P I , PI) 
LOG(2**-5C*PI, 2**5Q*PI) 

SUBRANGE PERCENTAGES 
LINEAR (-P I, PI) 
LOG <2**-18*PI, 2**1 8*PI) 
SUBRANGE PERCENTAGES 
LINEAR ( -PI, PI) 
ljQG(2 M *-5a*PX,2* l *50*PI) 
SUBRANGE PERCENTAGES 
LINEAR (—PI/ 2, PI/2 J 
LOG<2'*-18*PI,2**18*PI) 
SUBRANGE PERCENTAGES 
LINEAR (-PI/2, PI/2) 
LOG<2"-39*PI,2»*39*PI) 
SUBRANGE PERCENTAGES 
LINEAR<-PI/2,PI/2J 
LOO(2**-1B»FI r 2"«18"PI) 
SUBRANGE PERCENTAGES 
LI NEAR (-PI/2 , PI/2 ) 
LOG(2**-39*PI,2**39*PI) 
SUBRANGE PERCENTAGES 
LINEAR t-1 00 , 100) 
1INEAR(-16, 16) 
LINEAR (-100, 100) 
LINEARl-16,16) 
LOG<l6**-65, 16**63) 
LOG( 1 6* »-65, 16**63) 
LOG(16«»-65, 16**63) 
LOC(16«--65, 16**63) 
LOC06**-6S,16*-63) 
LOG(l6*»-65, 16**63) 
LINEAR(.1,10)**60 
L0C(16**-6S,16**63)**.7 
LI NEAR ( . . , 10) **60 
LOG ( 16«-65 , 1 6**63 ) . 7 
POLAR 1 16*-- 16, 16" 16) 
POLAR(l6**-16 f 16-*16> 



9.90 
12.21 
10.58 
10,71 
13.47 

11-41 

179,62 
496.24 
174,29 
461.10 
102.96 
360,42 
152.27 
^63. 13 
102.36 
1 6 1 . 64 
(32,66) 
346.94 
377.66 
(29,71) 
105.72 
150.54 
(24,-76) 
339.79 
3*32.68 
(28,72) 
143.20 
176.30 
(28,72) 
442.53 
4Q1.B5 
(25,75) 
142.92 
197.9*3 
(36,64) 
442. 98 
620.00 
(42,58) 
227.40 
254.82 
445.59 
445>.43 
232.36 
431 .27 
280.06 
594.90 
153.86 
279.65 
481 .5? 
480.91 
1032.14 
1030.97 
233.07 
639.55 



3.07 
3.93 
3.17 
3.54 
4.63 
3.74 
33.29 
61.73 
30.72 
56.79 
25.62 
53.49 
39.41 
72.55 
23.33 
39.94 

(30,70) 
02.49. 
53.85 

(29,72) 
24.13 
36.88 

(22,78) 
40.95 
59.26 

(21,79) 
33.11 
32.08 

(33,67) 
65.01 
66.56 

(20,80) 
33. 18 
39.27 

(44,56) 
64.96 

104. I* 

(33,67) 
20.35 
39. 14 
76.55 
76.55 
48.66 
74,66 
53.23 

105.90 
25.65 
52.75 
81.91 
81.93 

125.99 

125.9a 
33.96 
BO. 07 



4.B8 
5.66 
5.18 
5.05 
5,72 
5.12 



33.23 
64.47 
34.59 
62.68 
28.93 
51. 19 
41.78 
67. 17 
25.68 
34. 13 

(37,63) 
46.32 
49.98 

(31.69) 
27.08 
32. 14 

(32,68) 
46. 18 
51 . 10 

(30,70) 
34.57 
36.64 

(36,64) 
56.57 
62.72 

(29,72) 
34.29 
39.51 

<40,60) 
56.41 
76.84 

(41,59) 
32.99 
37.46 
65.91 
65.73- 
37.84 
63.06 
45.44 
60.92 
34.63 
46.95 
73.37 
73.54 

128.40 

129.72 
46.76 
86.04 



2.19 
2.61 
2.43 
2.22 
2.72 
2.53 
12.57 
20.01 . 
13.05 
19.21 
12.81 
19.70 
18.79 
26,30 
9.99 
12.37 

(38,62) 
14.77 
16.18 

(32, 6B) 
12.66 
11.90 

(31,69) 
14.26 
16.99 

(30,70) 
14.05 
13.97 

(38,62) 
21.02 
23.68 

(27,73) 
14.49 
15.51 

<4S,55) 
21.36 
28.95 

(40,60) 
8.56 
10.25 
21.68 
21*67 
13.47 
20.69 
IS. 39 
25.23 
13.19 
19.94 
25.44 
25.69 
41 .89 
41.62 
17.40 
31 .49 



.95 
.96 
.84 
.85 
1 .00 
.86 
4,56 
7.59 
4.50 
7.26 
4.00 
6.63 
5.83 
9.23 
3.51 
4.48 

<30,62) 
S.83 
6.20 

(33,67) 
3.61 
4.17 

(34,66) 
5.61 
6.26 

(32,68) 
4.75 
4.97 

(37,63) 
7.70 
B.07 

(31,69) 
4.72 
5.37 

(42,58) 
7.68 
9.62 

(43,57) 
3.51 
3.8S 
7.30 
7.30 
4.40 
7.07 
5.28 
9. 16 
a. £0 
6.46 
8.62 
8.63 
15. 3B 
15,39 
5.97 
10.78 



.46 
.55 
.47 
.49 
.56 
.50 
2.06 
3.09 
2. 14 
3.01 
2.04 
2.99 
2.94 
4.21 
1,56 
2.07 

(38,62) 
2.23 
2.53 

(34,66J 
1 .73 
1.97 

(35,65) 
2.28 
2.60 

(32,68) 
2.28 
2.27 

(38,62) 
3.30 
3.76 

(29,71) 

' 2.28 
2.55 

(44,56) 
3.27 
4.49 

(<ao,60) 

1.74 
1.97 
3.04 
3.03 
2.01 
3.05 
2.46 
3.56 
2.15 
3.21 
3.69 
3.69 
5.76 
5.75 
2.66 

4.64 



1.41 
1.68 
1.76 
2.16 
.82 
.82 

(46,54) 
1.15 
2.02 

(25,75) 
.86 
1.28 

(66.34) 
1.09 
2.71 

(43,57) 



2.29 
3.39 
(36,64) 



2.28 
4.11 
(47,53) 
.62 
.62 
.89 
.88 
.74 
.94 
.77 
1.14 
.68 
.94 
1.26 
1.26 
2,18 
2.18 
.92 
1.81 



the new vector library on the 3090 (with Vector Facility)- 
For foil vector lengths, a user win get this gain when he 
moves from the to the 3090 Vector Facility. 

Column 2 is the ratio or the times of the new scalar library 
to the times of the new vector library on the 3090- 

Genenrting precise times is difficult, since seemingly 
inconsequential changes in the timing procedure may nave a 
noticeable effect on the measured times. For example, on the 
3Q61KX the performance of the STM and LM instructions 
is severely degraded near page boundaries. This means that 
m the rare event that the save area of a subroutine is near a 



page boundary, the speed of execution of the subroutine will 
be substantially decreased. Also, the functions timed are not 
necessarily the final versions, as changes are being made. For 
these reasons the reader should interpret Tables 4-7 as 
giving a general indication rather than a precise 
measurement of what performance (hi terms of speed) users 
may expect from the new library. 

♦ Accuracy 

Tables 1-3 give accuracy figures for the new library. 
VFORTLJB, and VaLTLIB, respectively. For each function 
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Table 5 ft»U*s of VS FORTRAN Version I library mcasuiwnente lo Vcmon 2 counlc*paf«. 



FUNCTION DISTRIBUTION ( RANGE) 



ACOS ClFCtO.PH 
DACUS CiKC(0,*I5 

ASIN CIRC(-PI/i.Pl/2) 
DASIM CTRC(-PI/2,PI/2} 

ATAN TAN(-PT/2,P:/21 
DATAN TAN (-PI/2.P 1/21 
AT AN 2 POUUU 16»*-16, 16»*16) 
DATAN2 P0LAR(16*"-16, U**"»6) 
0)5 LINEAR I -PI. PI) 
COS LQG<2V-ie-PI,2*»1B*PI) 
DCOS UNEAA(-PI,Pr.) 
DCOS IjOG(2**-!>n i, Pl , 2**50*PI) 
SIN LINEAR* -PL PI) 
SIN L0G(2*'-1B-PI,2«18*PI) 
DSIN LlNEAR(-PI r Pl) 
DSlN LOG(2*»-50*PI,2*»S0»PI) 
TAN LlNEAR(-PI/2,Pl/2> 
TAN LOG(2*»-1B'PIr2**18*PI) 
DTAN LINBAR[-Pi/2 r PI/2) 
DTAN LOG(2**-39»Pl,2»*39*PI) 
COT AN LINEAR{-PI/2,PI/2) 
COT AN L0G(2**-1B*PT*2*»18*PD 
DCOTAN LINEAR! -PI/2, PI/2) 
D COT AN LOG(2*»-39*PI # 2**39*PI) 
EXP LINEAAI-100,100) 
EXP HN£AR(-16, 16) 
PEXP LINEAR* -100, 100) 
DEXP LINEAR(-16 t 161 
ALOG L0G(16-»-65, 16**631 
DLOG LOG O 6* »-6S, 16**63) 
ALOGlO LOGM6»*-65,16"63) 
DUX310 L0G(16**-6S, 16**63) 
SQRT LOG (16" "'65. 16**63) 
DSQRT LOG M6**-65, 1£**63) 
X**Y LINEAR i , 1 , 1 0) ■■60 
X»«Y LOG<16-*-65, 16**63)*-,"i 
DX*»Y LINEAR{ .1,10) **60 
OX*»Y LOG(16*»-65,16"63j**.1 

CABS POLAR { 1 6**-l6, 16**16) 
CDABS POLAR( 16«*-l6, 16**161 



4331-2 4361-S q3«1-2 43B1-3 308 1 Kjj 3090-5 3090-V 



1.16 


1 .86 


1.89 


y. 


53 


2.13 


i.i'l 




1.37 


2.71 


1.64 


2. 


71 


2.07 


2.57 




1.20 


2.02 


1.79 


2. 


39 


2.16 


2.26 




1 .41 


2.94 


1.67 


2. 


80 


2.17 


2.63 






1.61 


1.67 


1 . 


67 


1.72 


l .59 


2. 30 


1,59 


2.25 


1.09 


1.96 


1 .74 


1.07 


3.33 


1.28 


1.32 


1.43 


1.18 


1 .5^ 


1 .ha 


2.40 


1 .42 


1 .97 


1 .44 


1. 


;« 


1-55 


1 .65 


3.22 


1.63 


1.25 


1.57 


1 . 


A} 


1 .42 


l .42 


2.71 


.98 


,60 


1 .10 


1 , 


06 


1 .0^ 


1 .01 


2.55 


1.55 


1.5B 


I.a5 


1 . 


51 


1 .39 


1 .UB 


2 .R6 


1 .44 


1.29 


1.27 


1 


38 


1 -2S 


1 .28 


1.61 


1 ,ss 


1.20 


1.o9 


i 


1 1 


1 . 37 


1 .27 


2 .55 


1.'0 


.76 


1.21 


l 


14 


1 . 14 


1 .08 


l . 66 


1.59 


1.64 


1 .4b 


1 


57 


1 .40 


1 .-04 


3.02 


1.4* 


1.15 


1 .24 


1 


.29 


l . 22 


1 . 23 


1.18 


1 .41 


1.23 


1 .46 


1 


.«a 


i .48 


1 .03 




1 . 28 


1 . 44 


1 cc 

1 . 3D 


1 


.66 


1 63 


1 72 




1.2B 


1 .30 


1 .36 


i 


.42 


K33 


K38 


1.99 


1 . n 


.97 


1.23 


1 


.28 


1.26 


1.20 


1.34 


1 .44 


1 .25 


1.52 


1 


.46 


1.S0 


1.47 




1.17 


l .25 


1 .50 


1 


.56 


1 .57 


1 -6C 




1.J9 


1.32 


1,39 


1 


.46 


1 , 36 


1.42 


2.03 


.92 


.83 


1.03 


l .09 


1 .09 


1 .05. 


1 . 14 


.7S 


1.81 


1.32 


a, 42 


1 .90 


1 .84 


5.18 


.65 


.92 


1.15 


1 


.99 


1 .70 


1.61 


5.11 


1,18 


i.oe 


1.16 


i 


.12 


1.10 


1 . 13 


3.85 


1.18 


1.07 


1. 16 


l 


.10 


1.17 


1 .12 


3.84 


.74 


.91 


1.24 




.an 


1.43 


1 .49 


4 .04 


1 .35 


1 .57 


1.18 


i 


.45 


1.37 


1 .46 


4.73 


.$* 


.BG 


1.11 


1 


.33 


1 .24 


1 .20 


3.96 


1 .05 


1 .13 


.w 


\ 


.21 


1.10 


1 .28 


4.00 


.70 


1 .34 


.96 


1 


.42 


1 .22 


1 .26 


£ , Q-Q 


.96 


1.46 


1.08 


1 




1.21 


1.27 


o.33 


.72 


.96 


1.20 


1 


.53 


1 . 44 


1 .56 


4 .58 


.72 


.96 


1.20 


1 


.51 


1 .43 


1.57 


4 .60 


1.12 


1.59 


1.17 


1 


.25 


1 . 17 


1 .29 


3.40 


1.11 


1 .57 


1.16 


l 


.27 


1.16 


1.29 


3.39 


.92 


1 .Hb 


1.32 


i ,8a 


1 .65 


1.07 


5.40 


.79 


1.55 


.97 


1 


.34 


1.20 


1.42 


3.64 
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J 0000 random arguments were generated according to the 
indicated distributions as defined above. The first column 
gives the percentage of arguments for which the function 
returns the correctly rouncled result The second column 
gives the average absolute error in ulps. Note that a perfect 
routine will have an average absolute error of 0.25 ulp if the 
residual errors arc evenly distributed between -0.5 and 0.5 
ulp. The third column gives the value of the 100m largest . 
absolulc error in ulps. (Note that wc would expect the 
function to return a value in error by less ihan the amount 
in the third column 99% of the time.) For all columns «c 
assume that the enact value is returned by the function of 
next higher precision. The tan and cotan functions in 
VFORTUB and VALTLIB give err0r returD ***** dasc 
10 a singularity (while ours compute a result). Cases where 
this occurs have been omitted in computing Columns 1-3 
for these functions. It should be noted that although the 
VFORTL1B (and to a lesser extent the VAl.TUB) functions 
appear to have very poor accuracy for the double-precision 
trigonometric functions on large arguments, a one-utp error 



(even a rounding error) in a large argument produced by a 
user can cause a propagated error even in the exact result 
which is comparable to that generaied by the corresponding 
VFORTUB function. 

Again, these tables should be taken as a general indication 
rather than a precise measurement or what performance (in 
terms of accuracy) the user can expect from the new library. 

- Utppfots 

Figures 1-9 give ulp plots for ihrce functions; DTAN in the 
range - ir/2 to jt/2 (Figs. 1-3), DSlN log distribution in the 
range 2T*V to 2*V (Figs. 4-6), and the short-precision 
X"Y, whh y = 60.1 and X linearly distributed in the range 
0. J to IO.0 (Figs. 7-9). The plots are given for all three 
libraries discussed in this paper: the new VS FORTRAN 
library (Figs. 1 . 4. and Ty, the old VS FORTRAN library 
(Figs- 2. 5. and 8). and VALTLIB or the Wang library (Figs. 
3, 6. and 9). These are scatter plots with randomly 
distributed arguments along the jr-axis and the 
OTTOponding ulp error along the v-axis. Note that when the 
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Tabid 6 Ratios of v ALT LIB library measurement* to Version 2 counterparts. 



FUNCTION 


DISTRIBUTION (RANGE) 


4331-2 


4361-5 


4341-2 


4301-3 


308 i KX 


3090-5 


3090-V 


DCOS 


LiS£Aft<-PI,PI) 


2,09 


2.55 


2.01 


2.30 


1.96 


1 .49 


2.88 


pcos 


LOG (2**-S0*P3 , 2*»50"PI) 


2,04 


2.16 


1 .64 


2.17 




1 .29 


1.61 


DSfN 


LINEAB<-P1,PU 


2.15 


2.65 


2.03 


2.43 


2.07 


1 .44 


3.02 


DSIN 


L.OG(2»*-50»PI ,2-*50»PI) 


2.05 


1 .93 


1 .60 


2.09 


1.8S 


1.23 


1 .IB 


DTAN 


LIKEAR(-Pl/2,Pl/2) 


1.71 


2.00 


1.81 


2.04 


1.79 


1.3© 


1 .99 


DTAN 


LDG(2**-39*PI,2**39*PI) 




1 .55 


1.6B 


1.89 


1.7* 


1.20 


1.34 


DCCTAH 


LI NtiAR (-PI/2 , PT/2) 


1.72 


2.02 


1.63 


2.09 


1.B0 


1.42 


2.03 


DCOTAN 


LOG(2**-39*PI ,2«»39*PI) 


1.29 


1.31 


.1.40 


1.57 


1.49 


1 .04 


1_14 


EXP 


LXHEARt-lOC, 1QOJ 


»ft1 


2.09 


1.39 


2.62 


1.93 


1 .84 


5.18 


FtXP 


LINEAR t- 16 , 16) 


.72 


1 .09 


1.25 


2.25 


U77 


1 .61 


5.11 


DEXP 


LINEAR (— 100 , 100) 


1 . 16 


1.15 


1 .1H 


1.2* 


1 .28 


1.13 


3.81 




LINEAEM-16,16) 


1.16 


1.16 


1.20 


1 .28 


1.29 


1.12 


3.84 


x**y 


LINEAR ( . 1, 10) "60 


.75 


1 .51 


1.40 


1 .99 


1.76 


1.57 


4.59 




L*H3<16»»-G5, 16**63>*». 7 


.75 


1.35 


1.41 


2-01 


1.76 


1 .57 


4.60 


dx**y 


LI WEAR (.1, 1C)»*60 


1 .27 


1 ,6« 


1.32 


1 .58 


1.49 


1.67 


£1.(1} 


DX**Y 


LOC( l6**-65, 16*" 63 J •*.7 


1.26 


1 .63 


1.32 


1 .58 


1.49 


1.67 


4.42 



ulp error is Urge, ix is plotted on the tog scale (Figs. 2, 5, 6, 
and fi). The colored line represent* 0.5 ulp error. For the 
linear ulp scale, all the points within the two colored lines 
represent correct rounding. For the logarithmic ulp scale, 
absolute value of the ulp error is plotted, and therefore all 
points below the colored line represent correct rounding. 

The plots tor the new library are essentially Featureless, as 
most of the arguments are correctly rounded. The only 
noteworthy feature is in Fig. 4, where for small arguments 
the ulp error is essentially zero. For these arguments, sin (x) 

x is a very good approximation, resulting in almost zero 
ulp errors. Although these plots look featureless on the 
present scale, they will show many features and peculiarities 
if magnified by a factor or 256 along the y*xis with 
appropriate magnification along the x axis to show various 
table intervals. 

The old VS FORTRAN library shows large errors due to 
argument reduction, as indicated in Fig. 2. near ±jt/2. As the 
argument approaches a multiple of t/2, the ulp erroT keeps 
increasing. The argument reduction error is amplified in Fig. 
5. where wc see a definite pattern. The staircase partem is 
caused by the hexadecimal arithmetic, and the steps an; a 
factor of 1 6 apan in both directions- The points above the 
staircase represent arguments near multiples of Another 
interesting feature of Fig. 5 is the spikes for small arguments, 
again at intervals of a factor of 16. These spikes are 
explained in the subsection on ulp plots in Section 2. 

Next, wc examine the equivalent plots for VALTUB. In 
this library the argument reduction is done more accurately. 
But as also in the old VS FORTRAN library, even for those 
arguments where the argument reduction is not needed, we 
find errors of up to 16 ulps, for DTAN in Fig. 3- The 
increased accuracy in argument reduction is more evident in 
Fig. 6. where the staircase phenomenon is delayed to much 
larger arguments. This indicates that more digits oF v are 
used in the argument reduction. Even for this library, if 



DSIN or DTAN is examined tor arguments very dose to 
very large ulp errors would be observed. The rest of the 
features of Fig. 6, including the spikes for small arguments, 
are very similar to those of Fig. 5. 

In the power function, when y log (x) is large, it is very 
important to calculate log (x) and v log (x) to a higher 
precision, to ensure good accuracy in the final result This 
apparently was not the case with the old VS FORTRAN 
library, as indicated by Fig. 8. figure 9 shows that this 
problem was considerably reduced in VALTLIR. Even for 
VALTLIB, experiments with very large exponents revealed 
errors of up to 1 00 ulps in short precision and several 
hundred ulps in long precision. The plot of Fig. 8 shows 
discontinuities very typical of hexadecimal arithmetic, when 
the result crosses to a new exponent. The new short- 
precision power function (Fig, 7) looks perfect, although it is 
not, 

5. Conclusions 

Wc have presented a description of a new practical theory 
for producing very accurate high-performance scalar and 
vector elementary functions for System/370 IBM machines. 
Our overall approach is to use tables in which we represent 
each function as ao exact Quantity plus a small correction. 
Although this idea is not new, we have shown, by a number 
of novd methods, that such an approach can be made 
extremely accurate and very last. In contrast, the previous 
approach of using rational functions as a minirnax 
approximation is not as accurate nor as fast. However, this 
older approach was developed when preservation of 
computer storage was important and the cost of a divide 
instruction compared to that of a multiply instruction was 
not large. 

The development oFthe table approach followed rather 
naturally from the new requirement to produce vector 
algorithms. The requirement of producing bitwise- 
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Table 7 

CoKim* 2: VS FORTRAN Version 2 ****** on 3Q9U 
limes on 30BI vcrtu* Version 2 vector on 309a 



FUNCTION DISTRIBUTION (RANGE) 



EDUH 
E20UM 

DDUH 
D2DQK 
CDDUM 

ACOS C1RC(0,PJ) 
PAC05 CIBC(0,P1) 

ASIN CIRC PI/2 
DAS IN CIRC (-PT/2- "/J) 

ATAN TAN I -PI/2, P 1/2) 
DATA* TAK(-P 1/2. PI/2) 
ATAN2 PC>LAB(16— 6.16"16 
DATAN2 FOLMine'-l* 
COS Lr*lEAR<-PI*PE> 
COS D0G(2**-1B*PI,2-1B*PD 

DCOS LINEAR (-Pi / p * ^ ^ CA-fvr , 
DC05 LOC(2«— SO»PI,2»*50*PI) 

SIN lXNEAR(-F*iPI> ^...^ 
STN LOCS<2*»-'B*PJ.2**1B*Pl) 

DSIN LINEAR PI) rrtik „ T , 

DSIM LOG(2—S0-PI f 2^50*PI) 

TAN LOG(2-*-1B*PT,2**1B*Pl) 
OTAN LINEAR I -PI/2, PI/2) 
DTM* L0G<2..-39'PI,2»*39-PI1 
COT AN LINEAR (-Pt/2,Pl/2> 
COTAN ^(2 M - 1fl *"' , / Q4FI) 
DCOTAN t-™E^<- p ^ 2 ' P * /2 > 6T| 

£XF LTNE*R(-103/ 100 > 
£XP L1NEAR(-1B,16) 
OEXP LINEAR (-100,100) 

OUQG LOG(16**-65.16"G3 
AI-OG^O L0G(1B"-65,16**63J 
DLOG10 L0G(l6—65/l6«63) 

SQRT LOCt16»*-65,l6**63 
DEQRT L0G(16»--6S.16*"63) 

X*«Y LINEAR{-1.10>**60 

DX»*Y LINEAR (-1. 10 >**«»0 

CABS POLAR(l6»-l|*16*M6 
CDABS P0LAH.(16* — ie,T6»*l6) 



J0B1K 
3090S 



3090S 
3090V 



1.R3 

1.75 

1.79 

1.16 

1.77 

1.72 

4,72 

5.09 

4.5* 

5.23 

3.36 

3.87 

3.05 

3.40 
3.1& 
2,24 
3.62 
3.06 
2.86 
2.41 
3.5U 
2.9Q 
3.09 
3-58 
3. TO 

2.ee 

3.1B 

3,30 

3.20 

2.33 

3. S3 

3.32 

2.B6 

2.82 

3.12 

3.18 

2.67 

2. S3 

2,50 

2.43 

3.36 

3.35 

3.11 

3.10 

3.70 

2.78 



i.a5 
1 .78 
1 .67 
1.95 
1 .90 
2.52 
1 .94 
1 .2S 
2.01 
1 .54 
2.09 
.96 



1 .44 
1 .12 



1.43 

1 .09 
2.81 
3. IB 
3.42 

2 44 

2.72 
3.24 
3.19 
3.12 
3.16 
3.41 
2.93 
2.93 
2.64 
2.64 
3. 89 
2.56 



3061K 
3090V 



4.89 
6.88 
5.10 
6.62 
6.07 
5.66 
7.02 
3.83 
5.74 
3.71 
?.«0 
2.82 



4.46 
2.99 



4.59 
2.54 
10.76 
10.56 
9.75 
9_72 
B.47 
10.31 
8.52 
B.82 
7.9*. 
8.31 
9. S3 
9. BO 
B. 22 
8.17 
10.71 
7.13 
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compare scalar algorithms that also petfood well came 
about when we achieved the capability of viewing difficult 
and or special arguments as rare cases so thai the main oodc 
sueam had short paihs with almost full vector lengths. These 
5 pccial situations arc quickly detected by very ™*™ vt 
tests and handled in the scalar mode. At time* we modified 
algorithms shghtly to achieve better performance on *c 
v^r machine whik maintaining the scalar eornpauWixy;. 
this was necessary because there is no vector counterpart for 
many scalar instructions. The rdadve timing of vanous 
Actions in vector and scalar mode fa amtc |d*«* 
which leads to changes on a vector implementation, torn* of 
SSolc* are somewhat expanded ford^ ^^n to 
achieve better performance. We have W 1 ^^*. 
vector midftfr-dd instruction wherever possible, especially 



in polynomial evaluations. In the vcOot mode, it is cheaper 
(on a per-elcment basis) vo mask off underflow. This has 
been done for some functions to improve performance. 
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